Development of Synthetic Microdata for Educational Use in Japan
نویسندگان
چکیده
Japan’s new Statistics Act has come fully into effect in April 2009. The new law allows access to Anonymized microdata, and at the same time it requires users to go through an application process and imposes some restrictions. The National Statistics Center (NSTAC) has developed a type of microdata which can be accessed without an application process and used without restrictions. These data do not contain original microdata, but consist of Synthetic microdata. The absence of an application process and usage restrictions make Synthetic microdata particularly suitable for educational use. This paper outlines the process for creating Synthetic microdata for educational use based on multi-dimensional tables derived from original microdata, and compares the characteristics of them. 1. BACKGROUND: THE LEGAL FRAMEWORK IN JAPAN Japan’s new Statistics Act has come fully into effect in April 2009, and allows the provision of Anonymized microdata (Article 36) and tailor-made tabulations (Article 34) for scientific purposes. Anonymized microdata are defined as "questionnaire information that is processed so that no particular individuals or juridical persons, or other organizations shall be identified." As Anonymized microdata are created using disclosure limitation methods and therefore different from the original microdata, they allow a wider use of official microdata including for higher education and academic research. The new Statistics Act has thus expanded the role of statistics in education and research in Japan. The National Statistics Center (NSTAC) compiles statistical tables for the surveys conducted by Japanese Ministries and Agencies such as the Statistics Bureau of Japan (SBJ). In addition, based on the new Statistics Act and a Cabinet order, NSTAC today plays a key role in the new framework for statistics education and research by operating a data archive that provides IASE/IAOS Satellite 2013 Contributed paper Makita et al. In S. Forbes and B. Phillips (Eds.) Proceedings of the Joint IASE/IAOS Satellite Conference, Macao, China. August 2013. 2 Anonymized microdata and tailor-made tabulations for data collected by government offices and ministries, and cooperates with academic research organizations to promote these services. In order to access Anonymized microdata, the new Statistics Act requires users to apply for permission, imposes some conditions on data usage and storage, and requires payment of access costs of approximately 10,000 JPY (100 USD) per file. These factors make the data less attractive for education and training. To provide an alternative that offers easier access, the NSTAC has developed Synthetic microdata for education and training that do not stem from Anonymized microdata and which can be accessed without an application process as well as used without restrictions. 2. SYNTHETIC MICRODATA FOR EDUCATIONAL USE This paper describes how these Synthetic microdata for educational use were created using multi-dimensional tabulation on original microdata from the 2004 ‘National Survey of Family Income and Expenditure’ conducted by the Statistical Bureau of Japan. Since the Synthetic microdata created stem from the original microdata in an indirect way, they are free from the application process or restrictions that apply to Anonymised microdata. Specifically, the Synthetic microdata were created by using microaggregation, which is one of the disclosure limitation methods adopted for microdata of official statistics. Characteristics of microaggregation are (1) creation of records with common values for all types of qualitative attributes based on multi-dimensional tabulation and (2) sorting and dividing records with common values for qualitative attributes into groups larger than a specific minimum size. In order for the Synthetic microdata to achieve distributions that approximately replicate those of the original microdata, a multivariate normal random number that replicates average, variance and co-variance of the original microdata was used based on the assumption that records are normally (or log-normally in the case of monetary amounts etc.) distributed within each cell of the multi-dimensional tables. The Synthetic microdata created in this research have about 30,000 records. 3. CREATING SYNTHETIC MICRODATA The detailed process for the creation of the Synthetic microdata is as follows: First, quantitative and qualitative attributes to be contained in the Synthetic microdata were selected. Second, records with common values for qualitative attributes were sorted into groups with a minimum size of 3. Third, tables were created in order to generate multivariate lognormal random numbers and records for which values for some quantitative attributes are 0. This process allows creating Synthetic microdata with characteristics similar to those of the original microdata. The detailed process is as follows: 1 For a more detailed explanation see Ito (2009) and Ito and Takano (2011). IASE/IAOS Satellite 2013 Contributed paper Makita et al. In S. Forbes and B. Phillips (Eds.) Proceedings of the Joint IASE/IAOS Satellite Conference, Macao, China. August 2013. 3 (1) Qualitative attributes were selected from the multi-dimensional statistical tables compiled based on the original microdata. Specifically, 14 qualitative attributes were selected based on the survey items that are used most frequently by researchers, including gender, age and employment status. 184 quantitative attributes were selected, including Yearly Household Income and Monthly Household Expenditures. (2) Records with common values for qualitative attributes were sorted into groups with a minimum size of 3. For records that have common values for some qualitative attributes and that refer to groups with a size of 1 or 2, values for the other qualitative attributes were transformed to ‘unknown’ (V) in order to create groups with a minimum size of 3. Figure 1 illustrates this process in the case of gender and employment status. (3) Two types of tables were created in order to generate 1) multivariate lognormal random numbers and 2) records where values for some quantitative attributes are 0. Tables of ‘Type 1’ contain frequency, mean, variance and covariance of quantitative attributes not including 0. The records on which these tables are based were classified by qualitative attributes in order to generate multivariate lognormal random numbers. Tables of ‘Type 2’ are tables created by sorting records based on whether values for quantitative attributes are 0 or not 0, and on this basis the values for some quantitative attributes in the records were transformed to 0. Figure 2 illustrates the creation of the Synthetic microdata and compares the frequency of the Synthetic microdata with that of the original microdata. To create the Synthetic microdata, logarithmic transformation was used for the original microdata. Then the above two types of tables were used to generate multivariate lognormal random numbers and transform the values for some quantitative attributes to 0. Lastly, exponential transformation was conducted. 4. COMPARISON OF ORIGINAL MICRODATA AND SYNTHETIC MICRODATA To establish the usability of the Synthetic microdata, their characteristics were compared to the original Microdata. Table 1 presents the comparison of average values between the two microdata for several quantitative attributes such as Yearly Income, Receipts, Income, Receipts other than Income, Expenditure, Living Expenditure and Non-Living Expenditure. The difference between the two microdata was calculated by deducting the averages of the attributes of the original microdata from that of the Synthetic microdata, and dividing the deduction by the average of the attributes of the original microdata. The results show that the averages of the attributes contained in the Synthetic microdata are quite similar to those in the original microdata. Table 2 presents a comparison of standard deviation for several quantitative attributes of the original microdata and the Synthetic microdata. The difference between the two was calculated by deducting the standard deviation of the attributes of the original microdata from that of the Synthetic microdata, and dividing the deduction by the standard deviation of attributes of the original microdata. The results show that the standard deviation for the Synthetic microdata is IASE/IAOS Satellite 2013 Contributed paper Makita et al. In S. Forbes and B. Phillips (Eds.) Proceedings of the Joint IASE/IAOS Satellite Conference, Macao, China. August 2013. 4 similar to that for the original microdata. Figure 3 shows histograms of ‘Receipts other than Income’ for the Synthetic microdata and the original microdata. The histograms of ‘Receipts other than Income’ are similar to each other. Figure 4 are scatter diagrams of Yearly Income and Non-Living Expenditure for both the Synthetic microdata and the original microdata. The scatter diagrams of Yearly Income and Non-Living Expenditure resemble each other, although the Synthetic microdata have more outliers than the original microdata because of the influence of multivariate lognormal random numbers on the frequency of the Synthetic microdata. Table 3 contains correlation matrices calculated for several attributes of records contained in the original microdata and the Synthetic microdata. By and large the correlation matrices appear similar. Therefore, the results show that the relationship between attributes of the two microdata is maintained.
منابع مشابه
Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics
We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business ...
متن کاملUsing CART to Generate Partially Synthetic, Public Use Microdata
To limit disclosure risks, one approach is to release partially synthetic, public use microdata sets. These comprise the units originally surveyed, but some collected values, for example sensitive values at high risk of disclosure or values of key identifiers, are replaced with multiple imputations. This article presents and evaluates the use of classification and regression trees to generate p...
متن کاملCombining Methods to Create Synthetic Microdata: Quantile Regression, Hot Deck, and Rank Swapping
Government agencies must simultaneously disseminate useful microdata and maintain confidentiality of individual records. Releasing synthetic data is one approach. We propose to create synthetic data using a combination of quantile regression, hot deck imputation, and rank swapping. The result is a releasable data set containing original values for a few key variables, synthetic quantile regress...
متن کاملReleasing multiply imputed, synthetic public use microdata: an illustration and empirical study
The paper presents an illustration and empirical study of releasing multiply imputed, fully synthetic public use microdata. Simulations based on data from the US Current Population Survey are used to evaluate the potential validity of inferences based on fully synthetic data for a variety of descriptive and analytic estimands, to assess the degree of protection of confidentiality that is afford...
متن کاملMorphology control of clay-mineral particles as supports for metallocene catalysts in propylene polymerization
Spray dry granulation of clay minerals was studied to obtain clay mineral base support material for metallocene supported olefin polymerization catalysts. The morphology of the granules was strongly influenced by the nature of the clay mineral itself. Because of swelling characteristics of montmorillonite, its water dispersion was highly viscous even in the low slurry concentration (< 4 wt %). ...
متن کامل